Feature Selection: An Exploration of Algorithm Performance

Jason Case, Abhishek Dugar, Daniel Nkuah, Khoa Tran

2024-07-30

Introduction

What is feature selection? Why do we care?

Introduction: Early Techniques

  • forward, backward, and stepwise variable selection in linear models.

  • Univariate screening procedures

Introduction: Modern Techniques

Advanced feature selection methods:

  • similarity-based
  • information-theoretical-based
  • sparse-learning-based
  • statistical-based

Classification of methods

  • Filter
  • Wrapper
  • Embedded
  • Hybrid

Methods: Correlation - Based Feature Selection (CFS)

  • Filter method
  • Uses the correlation coefficient to measure the relationship between each variable and the target, independently.

CFS’s feature subset evaluation function is:

\(\text{Ms} = \frac{k\bar{\text{rcf}}}{\sqrt k + k(k - 1)\bar{\text{rff}}}\)

where Ms is the heuristic ”merit” of a feature subset S containing k features, \(\bar{\text{rcf}}\) is the mean feature-class correlation and \(\bar{\text{rff}}\) is the average feature - feature intercorrelation.

Methods: Recursive Feature Elimination (RFE)

  • Wrapper method
  • Removes variables iteratively.

Steps:

  1. Train the classifier
  2. compute the ranking criterion for all features
  3. remove the features with smallest ranking values

For an unknown input vector x, a linear classifier takes the form

\(y(x)=\)sign\((w⋅x−b)\)

For LS-SVM, substituting the KKT conditions into the Lagrangian gives the objective function

\(L = -\frac{1}{2} \sum \sum \alpha_i \alpha_j (K(x_i, x_j) + \frac{\sigma_{ij}}{C}) + \sum \alpha_i y_i\)

Where \({\sigma_{ij}} = 1\) if \(i=j\) and \(0\) otherwise.

The ranking criterion can be obtained by:

\(D(L^{-m}) = -\frac{1}{2} \sum \sum \alpha_i \alpha_j (K(x_i, x_j) - (K(x_i^{-m}, x_j^{-m})\)

Where \(x_i^{-m}\) , \(x_j^{-m}\) are the vectors in which \(m\)-th feature has been removed.

Methods: Least Absolute Shrinkage and Selection Operator (LASSO)

  • Embedded method
  • Uses L1 Regularization to shrink coefficients to zero,

The LASSO estimate is defined by the solution to the l1 optimization problem

minimize \(\frac{\| Y - X\beta \|_2^2}{n}\) subject to \(X \sum_{j=1}^{k} \|\beta_j\|_1 < t\)

where \(t\) is the upper bound for the sum of the coefficients

Methods: CFS & RFE

  • Hybrid method
  • Combines filter and wrapper methods

Step 1: filtering using the correlation coefficient. (e.g. trim the “fat”low hanging fruit”) Step 2: remove variables iteratively using RFE.

Analysis: Spambase Dataset

  • 4601 instances of emails
  • 57 features for classification tasks
  • Binary classification: email is spam (1) or not (0)
  • 80/20 train/test split
  • Increased number of features by adding all two-way interactions.

Analysis: COVID-19 NLP Text Classification Dataset

  • 45k tweets related to COVID-19, labeled for sentiment analysis.
  • Five Sentiment label classes, ranging from extremely positive to extremely negative
  • Recoded to a binary classification task, positive (1) or negative (0)
  • 33,444 (91%) training and 3,179 (9%) testing records after recoding

“Bag of words”

  • text data converted into a matrix of word frequencies
  • each row represents a document
  • each column represents a unique word from the entire corpus
  • Large, sparse feature set

Statistical Modeling

Three metrics:

  • Accuracy on the test set

  • difference between accuracy on the training and test set (overtraining)

  • number of variables selected (model complexity)

Baseline: Full logistics model with no feature selection CFS: Select best of 20 correlation thresholds using cross validation RFE: Select best of 20 sized subsets using cross validation LASSO: Select best penalty term using cross validation CFS + FRE: Remove 20% of variables with lowest correlation, Select best of 20 sized subsets using cross validation

Results: Spambase Dataset

Method Number of Features Test Accuracy Score Accuracy Decrease from Train
Baseline 1653 0.915 0.070
CFS 126 0.936 -0.004
RFE 1144 0.911 0.067
LASSO 50 0.903 -0.006
CFS + RFE 683 0.920 0.045

Results: COVID-19 NLP Text Classification Dataset

Method Number of Features Test Accuracy Score Accuracy Decrease from Train
Baseline 4820 0.846 0.097
CFS 1449 0.868 0.042
RFE 2980 0.863 0.059
LASSO 4472 0.880 0.045
CFS + RFE 2672 0.778 0.088

Results: Records per Feature

Summary

Conclusion